About the Project

This is my personal project page for the University of Helsinki statistics course named “Introduction to Open Data Science”.

About Me

My name is Pinja-Liina Jalkanen.

Please be forewarned that my statistics skills are rather crappy and usually limited to some simple descriptives and regressions. I do know some tech stuff quite well, though, so I don’t expect great difficulties with the R syntax, Git or the like. Kimmo Vehkalahti said on the first course lesson that data analysis is often 90% preparation of the data and 10% doing the analysis. I don’t know about the weighing on this course, but I usually don’t need help with the 90% at all; that’s mostly like, RTFM, STFW and the like anyway.

With regard to the 10% though, I don’t trust my skills at all.

Link to this project: https://github.com/pinjaliina/IODS-project


Analysis of the Learning Questionnaire

Introduction

The data of the analysis in this exercise was the Autumn 2014 Introductory Statistics Course Learning Questionnaire. The data has 60 variables and 183 observations; except for the few background related variables (age, attitude towards statistics, course points), most of the questions were learning-related questions with answers given on Likert scale, from 1 to 5.

For the analysis, I’ve averaged the variables related to deep learning, surface learning, and strategic learning. (For the initial data wrangling part of this exercise, see this R script).

Overview of the Data

The data is read in as follows:

# Read in the data
lrn2014a <- as.data.frame(read.table('data/learning2014.csv',  sep="\t", header=TRUE))

A scatter plot matrix of the variables of the data can be drawn as follows, coloured by the gender variable:

p <- ggpairs(lrn2014a, mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)))
p

A summary of each of the variables of the data can be displayed as follows:

summary(lrn2014a)
##       age           attitude         points      gender 
##  Min.   :17.00   Min.   :14.00   Min.   : 7.00   F:110  
##  1st Qu.:21.00   1st Qu.:26.00   1st Qu.:19.00   M: 56  
##  Median :22.00   Median :32.00   Median :23.00          
##  Mean   :25.51   Mean   :31.43   Mean   :22.72          
##  3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:27.75          
##  Max.   :55.00   Max.   :50.00   Max.   :33.00          
##       deep            stra            surf      
##  Min.   :1.625   Min.   :1.583   Min.   :1.250  
##  1st Qu.:3.500   1st Qu.:2.417   1st Qu.:2.625  
##  Median :3.875   Median :2.833   Median :3.188  
##  Mean   :3.796   Mean   :2.787   Mean   :3.121  
##  3rd Qu.:4.250   3rd Qu.:3.167   3rd Qu.:3.625  
##  Max.   :4.875   Max.   :4.333   Max.   :5.000

As can be seen from the output, most of the variables – with the exception of the age of the students – are distributed quite randomly and only correlate weakly with the course points. The most significant exception is the attitude towards statistics, which correlates more strongly with the course points than any other variable. Except for the age and points, the distribution of most of the variables seems to be reasonably close to normal distribution. The gender variable demonstrates that the course was attended by significantly more female than male students.

Fitting of a Regression Model

A linear regression model can be fit to the data as follows:

lrn2014_model <- lm(points ~ attitude + stra + surf, data = lrn2014a)

The chosen explanatory variables for the model were attitude, strategic learning and surface learning. The summary of the model can be printed as follows:

summary(lrn2014_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn2014a)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra        -0.58607    0.80138  -0.731  0.46563    
## surf         0.85313    0.54159   1.575  0.11716    
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

However, as can be seen from the model summary, the estimates of strategic and surface learning have no statistical significance explaining the course points; given the weak correlation of those variables with the points this was somewhat expected. It thus makes more sense to eliminate at least the variable that has the lowest probability value, strategic learning:

lrn2014_model <- lm(points ~ attitude + surf, data = lrn2014a)
summary(lrn2014_model)
## 
## Call:
## lm(formula = points ~ attitude + surf, data = lrn2014a)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## attitude     0.34658    0.05652   6.132 6.31e-09 ***
## surf         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

Analysis of the Summary

With the removal of that value the statistical significance of the surface learning estimates improves somewhat to near significant levels and can be left to the model. In practice its effect to the model is so tiny that it nearly as well be left out, but because it is reasonably close to statistical significance and also gives highest adjusted R2 value – the adjusted R2 of a model where attitude is the only explanatory variable as reported by summary() would be 0.1856 – its inclusion to the model is justified.

While the model estimates are statistically significant, the model as a whole is not very good: the multiple R2 value is only 0.20, which means that about 80 % of the relationship between the dependent variable – course points – and the explanatory variables remains unexplained. Thus, any predictions based on the model alone might not be very reliable.

Diagnostic Plots and Assumptions of the Model

In addition to linearity, linear models are fitted with the assumption that:

  1. The errors of the model are normally distributed.
  2. The errors are not correlated.
  3. The size of the errors does not depend on the explanatory variables.

The validity of these assumption can be tested by analysing the residuals of the model. This can be done with the help of different kinds of diagnostic plots. In the following figure three different plots are drawn:

  • A residuals vs. fitted values plot
  • A Q–Q-plot
  • A Residuals vs. leverage plot
par(mfrow = c(1,3)) # Set some graphical params.
# It seems that the drawing order of the plot is independent of the vector
# order and can't be changed.
plot(lrn2014_model, which = c(1,2,5))

Interpretation of the plots:

  1. The Q–Q-plot demonstrates that the standardised residuals of the model fit to the theoretical quantities reasonably well, so the normal distribution assumption is valid.
  2. The residuals vs. fitted values plot doesn’t show any kind of pattern, so the errors are not correlated and their size is independent of the explanatory variables (their σ2 is constant).
  3. In addition to checking the abovementioned assumptions, it is recommended to check that no single observation has an oversized effect on the model, because these might distort the model coefficients. As the x-axis scale of the residuals vs. leverage plot is relatively narrow with no significant outliers, we can conclude that the model is not distorted by any single observation.

Analysis of Alcohol Consumption

Introduction

The data of the analysis in this exercise was Fabio Pagnotta’s and Hossain Mohammad Amran’s Using Data Mining To Predict Secondary School Student Alcohol Consumption (2008), published by Department of Computer Science of the University of Camerino (link: https://archive.ics.uci.edu/ml/datasets/STUDENT+ALCOHOL+CONSUMPTION). For the analysis, I’ve also created a combined total alcohol consumption variable (sum of weekday and weekend consumption) and created a separate logical high use variable, based on the total consumption. (For the initial data wrangling part of this exercise, see this R script).

Overview of the Data

The data is read in as follows:

# Read in the data
alc <- as.data.frame(read.table('data/alc.csv',  sep="\t", header=TRUE))

A glimpse of all of the variables of the data can be displayed as follows:

glimpse(alc)
## Observations: 382
## Variables: 35
## $ school     <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, G...
## $ sex        <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, ...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15...
## $ address    <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize    <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, ...
## $ Pstatus    <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, ...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4...
## $ Mjob       <fctr> at_home, at_home, at_home, health, other, ser...
## $ Fjob       <fctr> teacher, other, other, services, other, other...
## $ reason     <fctr> course, course, other, home, home, reputation...
## $ nursery    <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet   <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes...
## $ guardian   <fctr> mother, father, mother, mother, father, mothe...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1...
## $ failures   <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ schoolsup  <fctr> yes, no, yes, no, no, no, no, yes, no, no, no...
## $ famsup     <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes...
## $ paid       <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no...
## $ higher     <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic   <fctr> no, no, no, yes, no, no, no, no, no, no, no, ...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2...
## $ absences   <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0,...
## $ G1         <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14,...
## $ G2         <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14,...
## $ G3         <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1...
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE...

Purpose of the Analysis

As can be seen from the above variable list, there are both numerical and factorial background variables. The target variable for this analysis is the binary high/low alcohol consumption variable. To analyse that, I’ve chosen the following four variables that I assume are indicative of students’ alcohol consumption:

  1. Absence of lessons (variable absence). I assume that students who skip lessons drink a lot drink more.
  2. Going out (variable goout). I assume that students who go out more do so to drink, so they drink more.
  3. Final grade (variable G3). I assume that students with poor grades drink more.
  4. Study time (variable studytime). I assume that students who spend less time studying spend more time drinking.

A summary and some plots of the chosen variables are shown below (boxplots’ whiskers extend to 75% of the interquartile range). I also grouped the box plots by sex to see any potential differences between them:

summary(alc[c('absences','goout','G3','studytime')])
##     absences          goout             G3          studytime    
##  Min.   : 0.000   Min.   :1.000   Min.   : 0.00   Min.   :1.000  
##  1st Qu.: 0.000   1st Qu.:2.000   1st Qu.: 8.00   1st Qu.:1.000  
##  Median : 3.000   Median :3.000   Median :11.00   Median :2.000  
##  Mean   : 5.319   Mean   :3.113   Mean   :10.39   Mean   :2.034  
##  3rd Qu.: 8.000   3rd Qu.:4.000   3rd Qu.:14.00   3rd Qu.:2.000  
##  Max.   :75.000   Max.   :5.000   Max.   :20.00   Max.   :4.000
p1 <- ggplot(alc, aes(x = high_use, y = absences, col=sex)) + geom_boxplot() + xlab('high use')
p2 <- ggplot(alc, aes(goout)) + geom_bar(aes(fill = high_use), position = "dodge", stat="count") + xlab('going out')
p3 <- ggplot(alc, aes(x = high_use, y = G3, col=sex)) + geom_boxplot() + ylab('final grade') + xlab('high use')
p4 <- ggplot(alc, aes(studytime)) + geom_bar(aes(fill = high_use), position = "dodge", stat="count")
# The multiplot() is defined in the init section of this file. For details, see
# http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/
multiplot(p1, p2, p3, p4, cols = 2)

Logistic Regression Analysis

Based on the above exploration of the variables, it looks like all my previously stated hypothetical assumptions were true to at least some extent, with perhaps the exception of final grade. To confirm this, I did a logistic regression analysis using my chosen variables as the explanatory variables.

In the following, a model is fitted to the data and summarised.

m <- glm(high_use ~ absences + goout + G3 + studytime, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ absences + goout + G3 + studytime, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8166  -0.7495  -0.5034   0.8071   2.5083  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.62195    0.62727  -4.180 2.92e-05 ***
## absences     0.05600    0.01643   3.407 0.000656 ***
## goout        0.74536    0.12013   6.204 5.49e-10 ***
## G3           0.01269    0.02848   0.446 0.655854    
## studytime   -0.60004    0.16852  -3.561 0.000370 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 383.16  on 377  degrees of freedom
## AIC: 393.16
## 
## Number of Fisher Scoring iterations: 4

By the model summary, it looks like my hypothesis was wrong with regard the final grade, which wasn’t a good predictor at all of high alcohol consumption: it bears no statistical significance whatsoever to it. All the other chosen explanatory variables are relatively strong predictors. Calculating the odds ratios hints to the same direction:

or <- exp(coef(m))
or
## (Intercept)    absences       goout          G3   studytime 
##  0.07266094  1.05759534  2.10719756  1.01277150  0.54879019

As shown by the odds ratios, a student has virtually the same likelihood to consume much alcohol regardless of the grade. Interestingly, the absences are not much of a factor either: a student with lot of absences is only 1.06 times more likely to consume a lot of alcohol. With regard outgoingness and studytime the results are, however, very clear: a student who goes out a lot is two times more likely to also drink a lot. Same goes for studytime: a student who studies a lot is almost half less likely to drink a lot. But while absence doesn’t increase the likelihood of high use a lot, comparing its odd ratio to its confidence interval still confirms its statistical significance:

ci <- exp(confint(m))
cbind(or, ci)
##                     or      2.5 %    97.5 %
## (Intercept) 0.07266094 0.02042166 0.2402599
## absences    1.05759534 1.02602413 1.0954317
## goout       2.10719756 1.67554867 2.6863511
## G3          1.01277150 0.95846241 1.0720167
## studytime   0.54879019 0.38995929 0.7564410

As shown by the confidence intervals of the odd ratios, the confidence intervals of all the other explanatory variables except that of final grade (G3) steer well clear of one, which means that the likelihoods that the odds predicted by them are low. With regard the final grade, however, my initial hypothesis was clearly wrong because the value one is almost in the middle of its confidence interval. Thus, before making any actual predictions using the model, it’s best to refit it without that variable:

m <- glm(high_use ~ absences + goout + studytime, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ absences + goout + studytime, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8376  -0.7530  -0.4927   0.8091   2.4967  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.47603    0.53168  -4.657 3.21e-06 ***
## absences     0.05627    0.01651   3.409 0.000651 ***
## goout        0.73711    0.11835   6.228 4.72e-10 ***
## studytime   -0.59422    0.16800  -3.537 0.000405 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 383.36  on 378  degrees of freedom
## AIC: 391.36
## 
## Number of Fisher Scoring iterations: 4
cbind(exp(coef(m)), exp(confint(m)))
##                             2.5 %    97.5 %
## (Intercept) 0.08407616 0.02874986 0.2322992
## absences    1.05788483 1.02622976 1.0958710
## goout       2.08988646 1.66701101 2.6539672
## studytime   0.55199236 0.39259400 0.7600378

As shown by the above statistics, the remaining explanatory variables are now all statistically highly significant and have confidence intervals well clear of the value one, so the model is now ready to be used for predictions.

Predicting with the Model

The model can be used for predictions as follows:

# Predict the probability.
probabilities <- predict(m, type = "response")
# Add the probabilities to alc.
alc <- mutate(alc, probability = probabilities)
# Calculate a logical high use value based on probabilites.
alc <- mutate(alc, prediction = probability > 0.5)
# Tabulate the target variable versus the predictions,
# with both absolute and proportional numbers.
tbl <- table(high_use = alc$high_use, prediction = alc$prediction)
addmargins(tbl)
##         prediction
## high_use FALSE TRUE Sum
##    FALSE   246   24 270
##    TRUE     66   46 112
##    Sum     312   70 382
round(addmargins(prop.table(tbl)), 2)
##         prediction
## high_use FALSE TRUE  Sum
##    FALSE  0.64 0.06 0.71
##    TRUE   0.17 0.12 0.29
##    Sum    0.82 0.18 1.00

As the tables show, the model is too careful in its predictions; it predicts less occurrences of high use than what the actual data shows. This can also be demonstrated graphically:

hu <- as.data.frame(prop.table(table(alc$high_use)))
pred <- as.data.frame(prop.table(table(alc$prediction)))
pp1 <- ggplot(hu, aes(Var1, Freq)) + geom_col(aes(fill = Var1)) + scale_y_continuous(limits = 0:1) + ylab('frequency') + xlab('observed high use') + theme(legend.position = 'none')
pp2 <- ggplot(pred, aes(Var1, Freq)) + geom_col(aes(fill = Var1)) + scale_y_continuous(limits = 0:1) + ylab('frequency') + xlab('predicted high use') + theme(legend.position = 'none')
multiplot(pp1, pp2, cols = 2)

The actual model training error can be calculated as follows (note that this is a function only because one is needed later on for cv.glm()):

mloss <- function(obs, prob) {
  res <- ifelse(prob > 0.5, 1, 0)
  mean(res != obs)
}
round(mloss(obs = alc$high_use, prob = alc$probability), 2)
## [1] 0.24

The training error is 24%, thus the model has a little over 75% accuracy. This isn’t perfect, but likely still better than any simple guessing strategy, given that by guessing alone I wasn’t able to predict my chosen variables’ statistical significance correctly.

Cross-validation (bonus task)

To test the model further, cross-validation can be performed. The following performs a 10-fold cross-validation:

cv <- cv.glm(data = alc, cost = mloss, glmfit = m, K = 10)
cv$delta[1] # Print the average number of wrong predictions.
## [1] 0.2329843

The average training error of the 10-fold cross-validation is 0.23, which is already better performance than the model introduced in the DataCamp exercise has. However, the model could be improved further by adding more variables. This model, however, does not include sex as a variable, while according to the DataCamp exercise it is a statistically significant variable. Thus, we could try adding it and run the cross-validation to further improve the model:

m2 <- glm(high_use ~ absences + goout + studytime + sex, data = alc, family = "binomial")
summary(m2) # Summarise the model.
## 
## Call:
## glm(formula = high_use ~ absences + goout + studytime + sex, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7586  -0.7653  -0.4897   0.7254   2.5996  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.23486    0.59934  -5.397 6.76e-08 ***
## absences     0.06269    0.01687   3.716 0.000202 ***
## goout        0.74151    0.12061   6.148 7.86e-10 ***
## studytime   -0.45081    0.17155  -2.628 0.008594 ** 
## sexM         0.81167    0.26937   3.013 0.002585 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 374.08  on 377  degrees of freedom
## AIC: 384.08
## 
## Number of Fisher Scoring iterations: 4
cbind(exp(coef(m2)), exp(confint(m2))) # Count odds and their confidence intervals.
##                            2.5 %    97.5 %
## (Intercept) 0.0393658 0.01172067 0.1236057
## absences    1.0646959 1.03217181 1.1038908
## goout       2.0990949 1.66738691 2.6783765
## studytime   0.6371102 0.45038109 0.8845879
## sexM        2.2516753 1.33375527 3.8439285
# Predict the probability.
probabilities2 <- predict(m2, type = "response")
# Add the probabilities to alc.
alc <- mutate(alc, probability2 = probabilities2)
# Calculate a logical high use value based on probabilites.
alc <- mutate(alc, prediction2 = probability2 > 0.5)
# Re-run cross-validation and print the average number of wrong predictions.
cv2 <- cv.glm(data = alc, cost = mloss, glmfit = m2, K = 10)
cv2$delta[1]
## [1] 0.2120419

The average training error is now only 0.21, thus clearly lower than in the previous model.


Classification of the Boston Dataset

Introduction

The data of this classification and clustering exercise was the Housing Values in Suburbs of Boston dataset, henceforth referred to just as Boston. It is available from the R package MASS, which includes Functions and datasets to support Venables and Ripley, “Modern Applied Statistics with S” (4th edition, 2002).. According to the reference manual of the package, the dataset includes 506 observations of the following 14 variables:

  • crim: per capita crime rate by town.
  • zn: proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus: proportion of non-retail business acres per town.
  • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  • nox nitrogen oxides concentration (parts per 10 million).
  • rm: average number of rooms per dwelling.
  • age: proportion of owner-occupied units built prior to 1940.
  • dis: weighted mean of distances to five Boston employment centres.
  • rad: index of accessibility to radial highways.
  • tax: full-value property-tax rate per $10,000.
  • ptratio: pupil-teacher ratio by town.
  • black: 1000 × (Bk − 0.63)2 where Bk is the proportion of blacks by town.
  • lstat: lower status of the population (percent).
  • medv: median value of owner-occupied homes in $1,000s.

According to the reference manual, the data is based on the following sources:

  • Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.
  • Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

Overview of the Data

After the MASS package has been loaded – by calling library(MASS) – the built-in datasets can be prepared simply by calling their names using data(). This enables accessing them by their names. They’re, however, loaded by using so-called lazy loading and thus only become data frames when they’re accessed for the first time (see help("data") for details):

# Prepare the data. (Quotes are optional but recommended; see help("data").)
data("Boston")

Glimpse the data to confirm it matches the reference manual:

# The data is only turned into an actual data frame at this point.
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, ...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, ...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, ...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 1...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 1...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 3...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15,...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 1...

Explore the data graphically:

# Subplot axis labels are partially too cramped, but I failed to find a working solution for that.
p <- ggpairs(Boston, mapping = aes(), lower = list(combo = wrap("facethist", bins = 10)), upper = list(continuous = wrap("cor", size=3)))
p

Show variable summaries:

summary(Boston)
##       crim                zn             indus      
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19  
##  Median : 0.25651   Median :  0.00   Median : 9.69  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74  
##       chas              nox               rm             age        
##  Min.   :0.00000   Min.   :0.3850   Min.   :3.561   Min.   :  2.90  
##  1st Qu.:0.00000   1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02  
##  Median :0.00000   Median :0.5380   Median :6.208   Median : 77.50  
##  Mean   :0.06917   Mean   :0.5547   Mean   :6.285   Mean   : 68.57  
##  3rd Qu.:0.00000   3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08  
##  Max.   :1.00000   Max.   :0.8710   Max.   :8.780   Max.   :100.00  
##       dis              rad              tax           ptratio     
##  Min.   : 1.130   Min.   : 1.000   Min.   :187.0   Min.   :12.60  
##  1st Qu.: 2.100   1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40  
##  Median : 3.207   Median : 5.000   Median :330.0   Median :19.05  
##  Mean   : 3.795   Mean   : 9.549   Mean   :408.2   Mean   :18.46  
##  3rd Qu.: 5.188   3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20  
##  Max.   :12.127   Max.   :24.000   Max.   :711.0   Max.   :22.00  
##      black            lstat            medv      
##  Min.   :  0.32   Min.   : 1.73   Min.   : 5.00  
##  1st Qu.:375.38   1st Qu.: 6.95   1st Qu.:17.02  
##  Median :391.44   Median :11.36   Median :21.20  
##  Mean   :356.67   Mean   :12.65   Mean   :22.53  
##  3rd Qu.:396.23   3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :396.90   Max.   :37.97   Max.   :50.00

As can be seen from the output, almost all of the variables variables are continuous, with the exception of the Charles River bound tract (chas) variable, which is binary, and the radial highways accessibility variable (rad), which is an index, but still measured on an interval level.

The distribution of most variables is rather skewed, except for the dwelling size (number of rooms; rm), which is normally distributed and median value of owner-occupied homes, which is nearly normally distributed. Some variables are very highly skewed, like the crime rate (crim), proportion of land zoned for very large lots (zn) and proportion of black people (black).

There are a lot of variables in the data, so it might be helpful to calculate a correlation matrix of the data and visualise it:

# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", tl.pos="d", tl.cex=0.6)

A numerical equivalent of a correlation plot including only the highest correlations might be created as follows. I personally found this to be the most intuitive way to find the highest correlations:

cb <- as.data.frame(cor(Boston)) # Create a DF of the correlation matrix.
cor_matrix_high <- as.data.frame(matrix(nrow = 14, ncol = 14)) #Copy
colnames(cor_matrix_high) <- colnames(cor_matrix) #the structure of
rownames(cor_matrix_high) <- rownames(cor_matrix) #cor_matrix.
cor_threshold <- 0.7
# Loop through the correlation matrix and save only values that exceed the threshold.
for(col in names(cb)) {
  for(row in 1:length(cb[[col]])) {
    if(abs(cb[[col,row]]) > cor_threshold & abs(cb[[col,row]]) < 1) { 
      cor_matrix_high[col,as.character(rownames(cb)[row])] <- round(cb[[col,row]], digits = 2)
    }
  }
}
# Print the matrix.
cor_matrix_high
##         crim zn indus chas   nox rm   age   dis  rad  tax ptratio
## crim      NA NA    NA   NA    NA NA    NA    NA   NA   NA      NA
## zn        NA NA    NA   NA    NA NA    NA    NA   NA   NA      NA
## indus     NA NA    NA   NA  0.76 NA    NA -0.71   NA 0.72      NA
## chas      NA NA    NA   NA    NA NA    NA    NA   NA   NA      NA
## nox       NA NA  0.76   NA    NA NA  0.73 -0.77   NA   NA      NA
## rm        NA NA    NA   NA    NA NA    NA    NA   NA   NA      NA
## age       NA NA    NA   NA  0.73 NA    NA -0.75   NA   NA      NA
## dis       NA NA -0.71   NA -0.77 NA -0.75    NA   NA   NA      NA
## rad       NA NA    NA   NA    NA NA    NA    NA   NA 0.91      NA
## tax       NA NA  0.72   NA    NA NA    NA    NA 0.91   NA      NA
## ptratio   NA NA    NA   NA    NA NA    NA    NA   NA   NA      NA
## black     NA NA    NA   NA    NA NA    NA    NA   NA   NA      NA
## lstat     NA NA    NA   NA    NA NA    NA    NA   NA   NA      NA
## medv      NA NA    NA   NA    NA NA    NA    NA   NA   NA      NA
##         black lstat  medv
## crim       NA    NA    NA
## zn         NA    NA    NA
## indus      NA    NA    NA
## chas       NA    NA    NA
## nox        NA    NA    NA
## rm         NA    NA    NA
## age        NA    NA    NA
## dis        NA    NA    NA
## rad        NA    NA    NA
## tax        NA    NA    NA
## ptratio    NA    NA    NA
## black      NA    NA    NA
## lstat      NA    NA -0.74
## medv       NA -0.74    NA

Variables tax (property taxes) and rad have a remarkably high correlation with each other: 0.91. This might mean that the taxation is at least partially based on the highway accessibility. Other relatively high correlations include e.g. the negative correlation between the amount of nitrogen oxides (nox) and rad (-0.77), positive correlation between industry presence (indus) and nox (0.76) and negative correlation between the proportion of pre-1940s buildings (age) and rad.

Standardising and Categorising the Data

To further analyse the dataset, the dataset must be standardised, i.e. all variables fit to normal distribution so that the mean of every variable is zero. This can be done as follows:

boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515  
##       age               dis               rad         
##  Min.   :-2.3331   Min.   :-1.2658   Min.   :-0.9819  
##  1st Qu.:-0.8366   1st Qu.:-0.8049   1st Qu.:-0.6373  
##  Median : 0.3171   Median :-0.2790   Median :-0.5225  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.9059   3rd Qu.: 0.6617   3rd Qu.: 1.6596  
##  Max.   : 1.1164   Max.   : 3.9566   Max.   : 1.6596  
##       tax             ptratio            black        
##  Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

We can now see from the output of summary() that this works as intended. We also need to categorise our target variable – crim – to classify it:

# Create a quantile vector of crim, and use it to create the categorical "crime".
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c('low','med_low','med_high','high'))
# Replace the original unscaled variable.
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
table(boston_scaled$crim) # Explore the categorised variable.
## 
##      low  med_low med_high     high 
##      127      126      126      127

Dividing the Data and Fitting the Model

To create the LDA model and to test it, the data has to be divided into training and testing sets. This can be done as follows, choosing randomly 80% of the data to be used for training:

n <- nrow(boston_scaled) # Get number of rows in the dataset.
ind <- sample(n,  size = n * 0.8) # Choose randomly 80% of the rows.
train <- boston_scaled[ind,] # Create train set.
test <- boston_scaled[-ind,] # Create test set.
# Save the correct classes from the test data.
correct_classes <- test$crime
# Remove the crime variable from the test data.
test <- dplyr::select(test, -crime)

With the data divided, it is now possible to fit the LDA model on the training set:

lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2425743 0.2549505 0.2475248 0.2549505 
## 
## Group means:
##                  zn      indus        chas        nox         rm
## low       1.0184911 -0.9045978 -0.11163110 -0.8803544  0.4609418
## med_low  -0.1115449 -0.2825412 -0.04298342 -0.5663564 -0.1342048
## med_high -0.3654691  0.1162216  0.12138096  0.3080398  0.1880336
## high     -0.4872402  1.0170891 -0.08120770  1.0403010 -0.4224060
##                 age        dis        rad        tax    ptratio
## low      -0.8797037  0.8998531 -0.6888949 -0.7590069 -0.4583341
## med_low  -0.3692625  0.3553846 -0.5581649 -0.4972214 -0.1090633
## med_high  0.3337648 -0.3404503 -0.3995984 -0.3146319 -0.2857040
## high      0.8237017 -0.8544174  1.6384176  1.5142626  0.7811136
##               black      lstat        medv
## low       0.3743885 -0.7735376  0.55716848
## med_low   0.3153310 -0.1373100  0.01247855
## med_high  0.1364794 -0.0900935  0.27064972
## high     -0.8639407  0.9305196 -0.70577202
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.11479554  0.904438618 -0.86504028
## indus   -0.01660852 -0.185060511  0.27291681
## chas    -0.09728633 -0.040208581  0.08078549
## nox      0.48055410 -0.687486235 -1.41845418
## rm      -0.10429272 -0.101895589 -0.11529294
## age      0.19818757 -0.174917384 -0.20719617
## dis     -0.07036028 -0.307649720  0.13247805
## rad      3.14035711  1.148596993 -0.01097091
## tax      0.04428185 -0.360711423  0.58708293
## ptratio  0.12974038  0.045313224 -0.34570427
## black   -0.13571599 -0.008593115  0.11399544
## lstat    0.24089482 -0.300760007  0.36337417
## medv     0.19441278 -0.431145684 -0.29709474
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9523 0.0346 0.0131

As we used quantiles to categorise the original variable, we’ve four classes. Thus, the output shows that we’ve three linear discriminants, as expected. Of these, the first explains vast majority – 94% – of the between-group variance.

The first two of the model’s linear discriminants can be visualised follows. A helper function is needed to draw the arrows in the biplot:

# Define a function for the biplot arrows.
lda.arrows <- function(x, myscale = 2, arrow_heads = 0.2, color = "deeppink", tex = 1, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime) # Turn the classes to numeric for plotting.
plot(lda.fit, dimen = 2, col = classes, pch = classes) # Plot.
lda.arrows(lda.fit) # Add arrows.

It’s possible to visualise all three discriminants, but the lda.arrows() function is incompatible with that:

plot(lda.fit, dimen = 3, col = classes, pch = classes) # Plot.

Predicting with the Model

We’ve already prepared the test set above, so it’s now possible to move straight into predictions:

lda.pred <- predict(lda.fit, newdata = test) # Predict the test values.
# Cross tabulate the predictions with the correct values.
addmargins(table(correct = correct_classes, predicted = lda.pred$class))
##           predicted
## correct    low med_low med_high high Sum
##   low       17      11        1    0  29
##   med_low    5      13        5    0  23
##   med_high   0       4       21    1  26
##   high       0       0        0   24  24
##   Sum       22      28       27   25 102

(I used ```addmargins() when tabulating, because in my opinion that’s more illustrative and helps. comparisons.) As seen from the table, the model did predict the highest of crime rates reliably, but the “med_low” category is overrepresented relative to the “low” and “med_high” categories. Thus, the model can be used to make crude predictions, but it’s hardly perfect. It might be better to use an unsupervised method and cluster the data instead of classifying it.

Clustering the Data

To cluster the data, it needs to be loaded and standardised again, and a distance matrix created out of it:

boston_scaled <- as.data.frame(scale(Boston)) # Standardise the data.
dist_eu <- dist(boston_scaled) # Create an euclidian distance matrix.
summary(dist_eu) # Summarise the matrix.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4620  4.8240  4.9110  6.1860 14.4000

We can try to cluster the data with k-means straight away. We used four classes for our LDA model, so we might try it with as many clusters instead:

km <-kmeans(dist_eu, centers = 4) # Cluster the data.
pairs(boston_scaled, col = km$cluster) # Plot the clusters.

However, while the results look somewhat reasonable, the amount of clusters was merely a guess. To determine it properly, the total within cluster sum of squares (TWCSS) should be calculated. Let’s try it, with a maximum of 15 clusters:

k_max <- 15 # Maximum number of clusters to try.
# Define a function for testing.
k_try <- function(k) {
  kmeans(dist_eu, k)$tot.withinss
}
# Calculate the total within sum of squares using the function.
twcss <- sapply(1:k_max, k_try)

# Visualize the results.
plot(1:k_max, twcss, type='b')

The optimal number of cluster is where the TWCSS drops radically; however, by inspecting the above plot, it’s somewhat debatable, whether this happens with with just two or four clusters. Thus, for comparison, let’s re-cluster the data with just two clusters:

km <-kmeans(dist_eu, centers = 2) # Cluster the data.
pairs(boston_scaled, col = km$cluster) # Plot the clusters.

As the plots above demonstrate, there seems to be less overlap between the clusters than with four clusters, which suggests that at least when using euclidian distance, the optimal number of clusters is indeed two. Because it’s possible that different distance measures produce different results, I also briefly tested my code by creating the dist_eu variable using the manhattan distance method instead, but found the results to be in this case so similar to the euclidian method that it’s not worth repeating those results here. The most notable difference was that with the manhattan method, the TWCSS plot hinted even more strongly that the optimal number of clusters is two.